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Abstract 

We consider the problem of modeling the con- 
tent structure of texts within a specific do- 
main, in terms of the topics the texts address 
and the order in which these topics appear. 
We first present an effective knowledge-lean 
method for learning content models from un- 
annotated documents, utilizing a novel adap- 
tation of algorithms for Hidden Markov Mod- 
els. We then apply our method to two com- 
plementary tasks: information ordering and ex- 
tractive summarization. Our experiments show 
that incorporating content models in these ap- 
plications yields substantial improvement over 
previously-proposed methods. 

Publication info: HLT-NAACL 2004: Pro- 
ceedings of the Main Conference, pp. 1 13-120. 

1 Introduction 

The development and application of computational mod- 
els of text structure is a central concern in natural lan- 
guage processing. Document-level analysis of text struc- 
ture is an important instance of such work. Previ- 
ous research has sought to characterize texts in terms 
of domain-independent rhetorical elements, such as 
schema items (McKeow n, 1985| or rhetorical relations 
( |Mann and Thompson, 1988) |Marcu, 19971 The focus 
of our work, however, is on an equally fundamental 
but domain-dependent dimension of the structure of text: 
content. 

Our use of the term "content" corresponds roughly 
to the notions of topic and topic change. We desire 
models that can specify, for example, that articles about 
earthquakes typically contain information about quake 
strength, location, and casualties, and that descriptions 
of casualties usually precede those of rescue efforts. But 
rather than manually determine the topics for a given 
domain, we take a distributional view, learning them 



directly from un-annotated texts via analysis of word 
distribution patterns. This idea dates back at least to 
|Harris (1982) , who claimed that "various types of [word] 
recurrence patterns seem to characterize various types of 
discourse". Advantages of a distributional perspective in- 
clude both drastic reduction in human effort and recogni- 
tion of "topics" that might not occur to a human expert 
and yet, when explicitly modeled, aid in applications. 

Of course, the success of the distributional approach 
depends on the existence of recurrent patterns. In arbi- 
trary document collections, such patterns might be too 
variable to be easily detected by statistical means. How- 
ever, research has shown that texts from the same domain 
tend to exhibit high similarity ( |Wray, 20 02 1. Cognitive 
psychologists have long posited that this similarity is not 
accidental, arguing that formulaic text structure facilitates 
readers' comprehension and recall (Bar tlett, 1932I I. 1 

In this paper, we investigate the utility of domain- 
specific content models for representing topics and 
topic shifts. Content models are Hidden Markov 
Models (HMMs) wherein states correspond to types 
of information characteristic to the domain of in- 
terest (e.g., earthquake magnitude or previous earth- 
quake occurrences), and state transitions capture possible 
information-presentation orderings within that domain. 

We first describe an efficient, knowledge-lean method 
for learning both a set of topics and the relations be- 
tween topics directly from un-annotated documents. Our 
technique incorporates a novel adaptation of the standard 
HMM induction algorithm that is tailored to the task of 
modeling content. 

Then, we apply techniques based on content models to 
two complex text-processing tasks. First, we consider in- 
formation ordering, that is, choosing a sequence in which 
to present a pre-selected set of items; this is an essen- 
tial step in concept-to-text generation, multi-document 
summarization, and other text-synthesis problems. In our 

'But "formulaic" is not necessarily equivalent to "simple", 
so automated approaches still offer advantages over manual 
techniques, especially if one needs to model several domains. 



experiments, content models outperform Lapata's ( 2003 1 
state-of-the-art ordering method by a wide margin — for 
one domain and performance metric, the gap was 78 per- 
centage points. Second, we consider extractive summa- 
rization: the compression of a document by choosing 
a subsequence of its sentences. For this task, we de- 
velop a new content-model-based learning algorithm for 
sentence selection. The resulting summaries yield 88% 
match with human-written output, which compares fa- 
vorably to the 69% achieved by the standard "leading n 
sentences" baseline. 

The success of content models in these two comple- 
mentary tasks demonstrates their flexibility and effective- 
ness, and indicates that they are sufficiently expressive to 
represent important text properties. These observations, 
taken together with the fact that content models are con- 
ceptually intuitive and efficiently learnable from raw doc- 
ument collections, suggest that the formalism can prove 
useful in an even broader range of applications than we 
have considered here; exploring the options is an appeal- 
ing line of future research. 

2 Related Work 

Knowledge-rich methods Models employing manual 
crafting of (typically complex) representations of content 
have generally captured one of three types of knowledge 
(Rambow, 1990 Kittredg e et al., 1991) : domain knowl- 
edge [e.g., that earthquakes have magnitudes], domain- 
independent communication knowledge [e.g., that de- 
scribing an event usually entails specifying its location]; 
and domain communication knowledge [e.g., that Reuters 
earthquake reports often conclude by listing previous 
quakes 2 ]. Formalisms exemplifying each of these knowl- 
edge types are DeJong's ( 1982 > scripts, McKeown's 
(1985 1 schemas, and Rambow's (1990) domain-specific 
schemas, respectively. 

In contrast, because our models are based on a dis- 
tributional view of content, they will freely incorporate 
information from all three categories as long as such in- 
formation is manifested as a recurrent pattern. Also, in 
comparison to the formalisms mentioned above, content 
models constitute a relatively impoverished representa- 
tion; but this actually contributes to the ease with which 
they can be learned, and our empirical results show that 
they are quite effective despite their simplicity. 

In recent work, Duboue and McKeown (2003 1 propose 
a method for learning a content planner from a collec- 
tion of texts together with a domain-specific knowledge 
base, but our method applies to domains in which no such 
knowledge base has been supplied. 



Knowledge-lean approaches Distributional models 
of content have appeared with some frequency in re- 
search on text segmentation and topic-based language 
modeling ( Hea rst, 1994| |Beeferman et al., 1997 
|Chen et al., 19981 |Florian and Yarowsky, 1999 

IGildea and Hof mann, 19991 |Iyer and Ostendor f, 1996 
|Wu and Khudanpur, 2002) . In fact, the methods we 
employ for learning content models are quite closely 
related to techniques proposed in that literature (see 
Section|3]for more details). 

However, language-modeling research — whose goal 
is to predict text probabilities — tends to treat topic as a 
useful auxiliary variable rather than a central concern; for 
example, topic -based distributional information is gener- 
ally interpolated with standard, non-topic-based n-gram 
models to improve probability estimates. Our work, in 
contrast, treats content as a primary entity. In particular, 
our induction algorithms are designed with the explicit 
goal of modeling document content, which is why they 
differ from the standard Baum-Welch (or EM) algorithm 
for learning Hidden Markov Models even though content 
models are instances of HMMs. 

3 Model Construction 

We employ an iterative re-estimation procedure that al- 
ternates between (1) creating clusters of text spans with 
similar word distributions to serve as representatives of 
within-document topics, and (2) computing models of 
word distributions and topic changes from the clusters so 
derived. 3 

Formalism preliminaries We treat texts as sequences 
of pre-defined text spans, each presumed to convey infor- 
mation about a single topic. Specifying text-span length 
thus defines the granularity of the induced topics. For 
concreteness, in what follows we will refer to "sentences" 
rather than "text spans" since that is what we used in our 
experiments, but paragraphs or clauses could potentially 
have been employed instead. 

Our working assumption is that all texts from a given 
domain are generated by a single content model. A con- 
tent model is an HMM in which each state s corresponds 
to a distinct topic and generates sentences relevant to that 
topic according to a state-specific language model p s — 
note that standard n-gram language models can there- 
fore be considered to be degenerate (single-state) content 
models. State transition probabilities give the probability 
of changing from a given topic to another, thereby cap- 
turing constraints on topic shifts. We can use the forward 
algorithm to efficiently compute the generation probabil- 
ity assigned to a document by a content model and the 



This does not qualify as domain knowledge because it is 
not about earthquakes per se. 



For clarity, we omit minor technical detai ls, such as the use 
of dummy initial and final states. Section l5l2l describes how the 
free parameters k, T, Si, and 82 are chosen. 



The Athens seismological institute said the temblor's epi- 
center was located 380 kilometers (238 miles) south of 
the capital. 



Seismologists in Pakistan's Northwest Frontier Province 
said the temblor's epicenter was about 250 kilometers 
(155 miles) north of the provincial capital Peshawar. 



The temblor was centered 60 kilometers (35 miles) north- 
west of the provincial capital of Kunming, about 2,200 
kilometers (1,300 miles) southwest of Beijing, a bureau 
seismologist said. 



Figure 1: Samples from an earthquake-articles sentence 
cluster, corresponding to descriptions of location. 

Viterbi algorithm to quickly find the most likely content- 
model state sequence to have generated a given docu- 
ment; see Rabiner (1989) for details. 

In our implementation, we use bigram language mod- 
els, so that the probability of an n-word sentence x = 

W\W2 ■ ■ ■ w n being generated by a state s is p s (x) = 
Y[? = iPs(u>i\wi_i). Estimating the state bigram proba- 
bilities p s (w'\w) is described below. 

Initial topic induction As in previous work 
j Horianjmd^arowsk^^ |Iyer and Ostendorf, 1996| 
|Wuand^hlid^ripTr72002y 7we initialize the set of "top- 
ics", distributionally construed, by partitioning all of the 
sentences from the documents in a given domain-specific 
collection into clusters. First, we create k clusters via 
complete-link clustering, measuring sentence similarity 
by the cosine metric using word bigrams as features 
(Figure shows example output). 4 Then, given our 
knowledge that documents may sometimes discuss new 
and/or irrelevant content as well, we create an "etcetera" 
cluster by merging together all clusters containing fewer 
than T sentences, on the assumption that such clusters 
consist of "outlier" sentences. We use m to denote the 
number of clusters that results. 

Determining states, emission probabilities, and transi- 
tion probabilities Given a set cx, C2, . . . , c m of m clus- 
ters, where c m is the "etcetera" cluster, we construct a 
content model with corresponding states s\, S2, . . . , s m ; 
we refer to s m as the insertion state. 

For each state Sj, i < m, bigram probabilities (which 
induce the state's sentence-emission probabilities) are es- 
timated using smoothed counts from the corresponding 
cluster: 

p Si {w w) = 

faiw) +dx\V\ 

where f Ci (y) is the frequency with which word sequence 
y occurs within the sentences in cluster ej, and V is the 

4 Following Barzilay 'and Lee (2003) , proper names, num- 
bers and dates are (temporarily) replaced with generic tokens to 
help ensure that clusters contain sentences describing the same 
event type, rather than same actual event. 



vocabulary. But because we want the insertion state s m 
to model digressions or unseen topics, we take the novel 
step of forcing its language model to be complementary 
to those of the other states by setting 



p Sm (w'\w) 



def 



1 



,p Si {w'\w) 



Note that the contents of the "etcetera" cluster are ignored 
at this stage. 

Our state-transition probability estimates arise from 
considering how sentences from the same article are dis- 
tributed across the clusters. More specifically, for two 
clusters c and d , let D(c, c') be the number of documents 
in which a sentence from c immediately precedes one 
from d , and let D{c) be the number of documents con- 
taining sentences from c. Then, for any two states Sj and 
Sj,i,j < m, we use the following smoothed estimate of 
the probability of transitioning from s, to sf 



D(cj,Cj) +6 2 
D(c.i) + 82m 



Viterbi re-estimation Our initial clustering ignores 
sentence order; however, contextual clues may indicate 
that sentences with high lexical similarity are actually on 
different "topics". For instance, Reuters articles about 
earthquakes frequently finish by mentioning previous 
quakes. This means that while the sentence "The temblor 
injured dozens" at the beginning of a report is probably 
highly salient and should be included in a summary of it, 
the same sentence at the end of the piece probably refers 
to a different event, and so should be omitted. 

A natural way to incorporate ordering information is 
iterative re-estimation of the model parameters, since the 
content model itself provides such information through 
its transition structure. We take an EM-like Viterbi ap- 
proach ( |Iyer and Ostendorf, 1996) : we re-cluster the sen- 
tences by placing each one in the (new) cluster Cj, i < m, 
that corresponds to the state s, most likely to have gen- 
erated it according to the Viterbi decoding of the train- 
ing data. We then use this new clustering as the input to 
the procedure for estimating HMM parameters described 
above. The cluster/estimate cycle is repeated until the 
clusterings stabilize or we reach a predefined number of 
iterations. 

4 Evaluation Tasks 

We apply the techniques just described to two tasks that 
stand to benefit from models of content and changes in 
topic: information ordering for text generation and in- 
formation selection for single-document summarization. 
These are two complementary tasks that rely on dis- 
joint model functionalities: the ability to order a set of 
pre-selected information-bearing items, and the ability 



to do the selection itself, extracting from an ordered se- 
quence of information-bearing items a representative sub- 
sequence. 

4.1 Information Ordering 

The information-ordering task is essential to many text- 
synthesis applications, including concept-to-text genera- 
tion and multi-document summarization; While account- 
ing for the full range of discourse and stylistic factors that 
influence the ordering process is infeasible in many do- 
mains, probabilistic content models provide a means for 
handling important aspects of this problem. We demon- 
strate this point by utilizing content models to select ap- 
propriate sentence orderings: we simply use a content 
model trained on documents from the domain of interest, 
selecting the ordering among all the presented candidates 
that the content model assigns the highest probability to. 

4.2 Extractive Summarization 

Content models can also be used for single-document 
summarization. Because ordering is not an issue in this 
application 5 , this task tests the ability of content models 
to adequately represent domain topics independently of 
whether they do well at ordering these topics. 

The usual strategy employed by domain-specific sum- 
marizers is for humans to determine a priori what 
types of information from the originating documents 
should be included (e.g., in stories about earthquakes, 
the number of victims) ( |Radev and McKeown, 19981 
[White et al., 200 1^ Some systems avoid the need 
for manual analysis by learning content-selection rules 
from a collection of articles paired with human- 
authored summaries, but their learning algorithms typ- 
ically focus on within-sentence features or very coarse 
structural features (such as position within a para- 
graph) (Kupiec et al., 1999 1. Our content-model-based 
summarization algorithm combines the advantages of 
both approaches; on the one hand, it learns all required in- 
formation from un-annotated document-summary pairs; 
on the other hand, it operates on a more abstract and 
global level, making use of the topical structure of the 
entire document. 

Our algorithm is trained as follows. Given a content 
model acquired from the full articles using the method de- 
scribed in Section |3] we need to learn which topics (rep- 
resented by the content model's states) should appear in 
our summaries. Our first step is to employ the Viterbi al- 
gorithm to tag all of the summary sentences and all of the 
sentences from the original articles with a Viterbi topic 
label, or V-topic — the name of the state most likely to 
have generated them. Next, for each state s such that 
at least three full training-set articles contained V-topic 

5 Typically, sentences in a single-document summary follow 
the order of appearance in the original document. 



Domain 


Average 


Standard 


Vocabulary 


Token/ 




Length 


Deviation 


Size 


type 


Earthquakes 


10.4 


5.2 


1182 


13.2 


Clashes 


14.0 


2.6 


1302 


4.5 


Drugs 


10.3 


7.5 


1566 


4.1 


Finance 


13.7 


1.6 


1378 


12.8 


Accidents 


11.5 


6.3 


2003 


5.6 



Table 1: Corpus statistics. Length is in sentences. Vo- 
cabulary size and type/token ratio are computed after re- 
placement of proper names, numbers and dates. 

s, we compute the probability that the state generates 
sentences that should appear in a summary. This prob- 
ability is estimated by simply (1) counting the number 
of document-summary pairs in the parallel training data 
such that both the originating document and the summary 
contain sentences assigned V-topic s, and then (2) nor- 
malizing this count by the number of full articles con- 
taining sentences with V-topic s. 

To produce a length-^ summary of a new article, the al- 
gorithm first uses the content model and Viterbi decoding 
to assign each of the article's sentences a V-topic. Next, 
the algorithm selects those I states, chosen from among 
those that appear as the V-topic of one of the article's 
sentences, that have the highest probability of generating 
a summary sentence, as estimated above. Sentences from 
the input article corresponding to these states are placed 
in the output summary. 6 

5 Evaluation Experiments 
5.1 Data 

For evaluation purposes, we created corpora from five 
domains: earthquakes, clashes between armies and rebel 
groups, drug-related criminal offenses, financial reports, 
and summaries of aviation accidents. 7 Specifically, the 
first four collections consist of AP articles from the North 
American News Corpus gathered via a TDT-style docu- 
ment clustering system. The fifth consists of narratives 
from the National Transportation Safety Board's database 
previously employed by Jones and Thompson (2003 1 for 
event-identification experiments. For each such set, 100 
articles were used for training a content model, 100 arti- 
cles for testing, and 20 for the development set used for 
parameter tuning. TableQpresents information about ar- 
ticle length (measured in sentences, as determined by the 
sentence separator of Reynar and Ratnaparkhi (1997)), 
vocabulary size, and token/type ratio for each domain. 

6 If there are more than I sentences, we prioritize them by 
the summarization probability of their V-topic's state; we break 
any further ties by order of appearance in the document. 

7 

http : / /www . sis. csail. mi t . edu/ ~ regina /struct 



5.2 Parameter Estimation 

Our training algorithm has four free parameters: two that 
indirectly control the number of states in the induced con- 
tent model, and two parameters for smoothing bigram 
probabilities. All were tuned separately for each do- 
main on the corresponding held-out development set us- 
ing Powell's grid search (Press et al., 1997 ). The parame- 
ter values were selected to optimize system performance 
on the information-ordering task 8 . We found that across 
all domains, the optimal models were based on "sharper" 
language models (e.g., Si < 0.0000001). The optimal 
number of states ranged from 32 to 95. 

5.3 Ordering Experiments 
5.3.1 Metrics 

The intent behind our ordering experiments is to test 
whether content models assign high probability to ac- 
ceptable sentence arrangements. However, one stumbling 
block to performing this kind of evaluation is that we do 
not have data on ordering quality: the set of sentences 
from an TV-sentence document can be sequenced in N\ 
different ways, which even for a single text of moder- 
ate length is too many to ask humans to evaluate. For- 
tunately, we do know that at least the original sentence 
order (OSO) in the source document must be acceptable, 
and so we should prefer algorithms that assign it high 
probability relative to the bulk of all the other possible 
permutations. This observation motivates our first evalu- 
ation metric: the rank received by the OSO when all per- 
mutations of a given document's sentences are sorted by 
the probabilities that the model under consideration as- 
signs to them. The best possible rank is 0, and the worst 
isAH-1. 

An additional difficulty we encountered in setting up 
our evaluation is that while we wanted to compare our 
algorithms against Lapata's (2003 1 state-of-the-art sys- 
tem, her method doesn't consider all permutations (see 
below), and so the rank metric cannot be computed for it. 
To compensate, we report the OSO prediction rate, which 
measures the percentage of test cases in which the model 
under consideration gives highest probability to the OSO 
from among all possible permutations; we expect that a 
good model should predict the OSO a fair fraction of the 
time. Furthermore, to provide some assessment of the 
quality of the predicted orderings themselves, we follow 
Lapata (2003 1 in employing Kendall 's r, which is a mea- 
sure of how much an ordering differs from the OSO — 
the underlying assumption is that most reasonable sen- 
tence orderings should be fairly similar to it. Specifically, 
for a permutation a of the sentences in an A-sentence 

8 See Section l531 for discussion of the relation between the 
ordering and the summarization task. 



document, t(ct) is computed as 



where S{a) is the number of swaps of adjacent sen- 
tences necessary to re-arrange a into the OSO. The metric 
ranges from -1 (inverse orders) to 1 (identical orders). 

5.3.2 Results 

For each of the 500 unseen test texts, we exhaustively 
enumerated all sentence permutations and ranked them 
using a content model from the corresponding domain. 
We compared our results against those of a bigram lan- 
guage model (the baseline) and an improved version of 
the state-of-the-art probabilistic ordering method of La- 
pata (2003 1, both trained on the same data we used. 
Lapata's method first learns a set of pairwise sentence- 
ordering preferences based on features such as noun-verb 
dependencies. Given a new set of sentences, the latest 
version of her method applies a Viterbi-style approxima- 
tion algorithm to choose a permutation satisfying many 
preferences (Lapata, personal communication). 9 

Table Ogives the results of our ordering-test compari- 
son experiments. Content models outperform the alterna- 
tives almost universally, and often by a very wide margin. 
We conjecture that this difference in performance stems 
from the ability of content models to capture global doc- 
ument structure. In contrast, the other two algorithms 
are local, taking into account only the relationships be- 
tween adjacent word pairs and adjacent sentence pairs, 
respectively. It is interesting to observe that our method 
achieves better results despite not having access to the lin- 
guistic information incorporated by Lapata's method. To 
be fair, though, her techniques were designed for a larger 
corpus than ours, which may aggravate data sparseness 
problems for such a feature-rich method. 

Table[5]gives further details on the rank results for our 
content models, showing how the rank scores were dis- 
tributed; for instance, we see that on the Earthquakes do- 
main, the OSO was one of the top five permutations in 
95% of the test documents. Even in Drugs and Accidents 
— the domains that proved relatively challenging to our 
method — in more than 55% of the cases the OSO's rank 
did not exceed ten. Given that the maximal possible rank 
in these domains exceeds three million, we believe that 
our model has done a good job in the ordering task. 

We also computed learning curves for the different do- 
mains; these are shown in Figure|2] Not surprisingly, per- 
formance improves with the size of the training set for all 
domains. The figure also shows that the relative difficulty 
(from the content-model point of view) of the different 
domains remains mostly constant across varying training- 
set sizes. Interestingly, the two easiest domains, Finance 

'Finding the optimal such permutation is NP-complete. 



Domain 


System 


KanK 


ncn 
UoU 

pred. 


T 


Earthquakes 


Content 


2.67 


72% 


0.81 


Lapata 


(IN/A) 


1A 07 


f\ /I 
U.40 


Bigram 




A OL 
470 


U.Z / 


Clashes 


Content 


3.05 


48% 


0.64 


Lapata 


(IN/A) 


A 1 70 


U.41 


Bigram 


oij. 1 j 


\\<~)C7 

IZ/o 


U./j 


Drugs 


Content 


15.38 


38% 


0.45 


Lapata 


/"NT/ A \ 
(IN/A) 


Li 70 




Bigram 


/ l/.Uj 


11/0 


U.Z4 


Finance 


Content 


0.05 


96% 


0.98 


Lapata 


(N/A) 


18% 


0.75 


Bigram 


7.44 


66% 


0.74 


Accidents 


Content 


10.96 


41% 


0.44 


Lapata 


(N/A) 


10% 


0.07 


Bigram 


973.75 


2% 


0.19 



Table 2: Ordering results (averages over the test cases). 



Domain 


Rank range 


[0-4] 


[5-10] 


> 10 


Earthquakes 


95% 


1% 


4% 


Clashes 


75% 


18% 


7% 


Drugs 


47% 


8% 


45% 


Finance 


100% 


0% 


0% 


Accidents 


52% 


7% 


41% 



Table 3: Percentage of cases for which the content model 
assigned to the OSO a rank within a given range. 

and Earthquakes, can be thought of as being more for- 
mulaic or at least more redundant, in that they have the 
highest token/type ratios (see Table [0 — that is, in these 
domains, words are repeated much more frequently on 
average. 

5.4 Summarization Experiments 

The evaluation of our summarization algorithm was 
driven by two questions: (1) Are the summaries produced 
of acceptable quality, in terms of selected content? and 
(2) Does the content-model representation provide addi- 
tional advantages over more locally-focused methods? 

To address the first question, we compare summaries 
created by our system against the "lead" baseline, which 
extracts the first I sentences of the original text — de- 
spite its simplicity, the results from the annual Docu- 
ment Understanding Conference (DUC) evaluation sug- 
gest that most single-document summarization systems 
cannot beat this baseline. To address question (2), we 
consider a summarization system that learns extraction 
rules directly from a parallel corpus of full texts and their 
summaries (Ku piec et al., 1999) . In this system, summa- 
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Figure 2: Ordering-task performance, in terms of OSO 
prediction rate, as a function of the number of documents 
in the training set. 

rization is framed as a sentence-level binary classifica- 
tion problem: each sentence is labeled by the publicly- 
available BoosTexter system (Scha pire and Singer, 2000| l 
as being either "in" or "out" of the summary. The fea- 
tures considered for each sentence are its unigrams and 
its location within the text, namely beginning third, mid- 
dle third and end third. 10 Hence, relationships between 
sentences are not explicitly modeled, making this system 
a good basis for comparison. 

We evaluated our summarization system on the Earth- 
quakes domain, since for some of the texts in this domain 
there is a condensed version written by AP journalists. 
These summaries are mostly extractive 11 ; consequently, 
they can be easily aligned with sentences in the original 
articles. From sixty document-summary pairs, half were 
randomly selected to be used for training and the other 
half for testing. (While thirty documents may not seem 
like a large number, it is comparable to the size of the 
training corpora used in the competitive summarization- 
system evaluations mentioned above.) The average num- 
ber of sentences in the full texts and summaries was 15 
and 6, respectively, for a total of 450 sentences in each of 
the test and (full documents of the) training sets. 

At runtime, we provided the systems with a full doc- 
ument and the desired output length, namely, the length 
in sentences of the corresponding shortened version. The 
resulting summaries were judged as a whole by the frac- 
tion of their component sentences that appeared in the 
human-written summary of the input text. 

The results in Table |4] confirm our hypothesis about 
the benefits of content models for text summarization — 
our model outperforms both the sentence-level, locally- 

10 This feature set yielded the best results among the several 
possibilities we tried. 

"Occasionally, one or two phrases or, more rarely, a clause 
were dropped. 



System 


Extraction accuracy 


Content-based 


88% 


Sentence classifier 
(words + location) 


76% 


Leading n sentences 


69% 



Table 4: Summarization-task results. 
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Figure 3: Summarization performance (extraction accu- 
racy) on Earthquakes as a function of training-set size. 

focused classifier and the "lead" baseline. Furthermore, 
as the learning curves shown in Figure [5] indicate, our 
method achieves good performance on a small subset of 
parallel training data: in fact, the accuracy of our method 
on one third of the training data is higher than that of the 
sentence-level classifier on the full training set. Clearly, 
this performance gain demonstrates the effectiveness of 
content models for the summarization task. 

5.5 Relation Between Ordering and Summarization 
Methods 

Since we used two somewhat orthogonal tasks, ordering 
and summarization, to evaluate the quality of the content- 
model paradigm, it is interesting to ask whether the same 
parameterization of the model does well in both cases. 
Specifically, we looked at the results for different model 
topologies, induced by varying the number of content- 
model states. For these tests, we experimented with the 
Earthquakes data (the only domain for which we could 
evaluate summarization performance), and exerted direct 
control over the number of states, rather than utilizing the 
cluster-size threshold; that is, in order to create exactly m 
states for a specific value of m, we merged the smallest 
clusters until m clusters remained. 

Table [5] shows the performance of the different-sized 
content models with respect to the summarization task 
and the ordering task (using OSO prediction rate). While 
the ordering results seem to be more sensitive to the num- 
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20 


40 


60 


64 


80 


Ordering 


11% 


28% 


52% 


50% 


72% 


57% 


Summarization 


54% 


70% 


79% 


79% 


88% 


83% 



Table 5: Content-model performance on Earthquakes as 
a function of model size. Ordering: OSO prediction rate; 
Summarization: extraction accuracy. 



ber of states, both metrics induce similar ranking on the 
models. In fact, the same-size model yields top perfor- 
mance on both tasks. While our experiments are limited 
to only one domain, the correlation in results is encourag- 
ing: optimizing parameters on one task promises to yield 
good performance on the other. These findings provide 
support for the hypothesis that content models are not 
only helpful for specific tasks, but can serve as effective 
representations of text structure in general. 

6 Conclusions 

In this paper, we present an unsupervised method for the 
induction of content models, which capture constraints 
on topic selection and organization for texts in a par- 
ticular domain. Incorporation of these models in order- 
ing and summarization applications yields substantial im- 
provement over previously-proposed methods. These re- 
sults indicate that distributional approaches widely used 
to model various inter-sentential phenomena can be suc- 
cessfully applied to capture text-level relations, empiri- 
cally validating the long-standing hypothesis that word 
distribution patterns strongly correlate with discourse 
patterns within a text, at least within specific domains. 

An important future direction lies in studying the cor- 
respondence between our domain-specific model and 
domain-independent formalisms, such as RST By au- 
tomatically annotating a large corpus of texts with dis- 
course relations via a rhetorical parser (Ma rcu, 19971 
ISoricut and Marcu, 20 03 1, we may be able to incorpo- 
rate domain-independent relationships into the transition 
structure of our content models. This study could uncover 
interesting connections between domain-specific stylistic 
constraints and generic principles of text organization. 

In the literature, discourse is frequently modeled using 
a hierarchical structure, which suggests that probabilis- 
tic context-free grammars or hierarchical Hidden Markov 
Models dFTne et al„ 1998 I may also be applied for model- 
ing content structure. In the future, we plan to investigate 
how to bootstrap the induction of hierarchical models us- 
ing labeled data derived from our content models. We 
would also like to explore how domain-independent dis- 
course constraints can be used to guide the construction 
of the hierarchical models. 
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